Goal: Know how to make decisions and answer questions using clustering.

Repeat the clustering process only using the Rep house votes dataset. What differences and similarities did you see between how the clustering worked for the datasets?

There were both similarities and differences. Firstly, the most prominent similarity is that both datasets only used 2 centers for clustering. Another similarity would be that the ratio of the within & between variance accounted for by the 2 clusters was high for both datasets. The differences were more noticeable. There was a slight difference between the elbow graphs for both datasets in that the Democratic elbow graph showed that 3 clusters may have been a better choice than 2 clusters, while the Republican elbow graph showed that 2 clusters was the optimal choice. Another difference I found between the two datasets was in the first set of plots created. The Democratic plot appeared to have more overlap - there were blue points in the Republican cluster and there were more outliers. In the Republican plot, there were far fewer outliers and even fewer miscolored points. This suggests that party line voting was more present for Republican-introduced bills.

#Select the variables to be included in the cluster 
house_votes_Rep = read_csv("C:/Users/Maddie/OneDrive/Desktop/3YEAR/Forked-DS-3001/06_Clustering/house_votes_Rep.csv")
## Rows: 427 Columns: 5
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (2): Last.Name, party.labels
## dbl (3): aye, nay, other
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
table(house_votes_Rep$party.labels)
## 
##   Democrat Republican 
##        198        229
View(house_votes_Rep)

# Define the columns to be clustered by sub-setting the data.
# Placing the vector of columns after the comma inside the 
# brackets tells R that you are selecting columns.
clust_data_Rep = house_votes_Rep[, c("aye", "nay", "other")]
#Run the clustering algo with 2 centers
set.seed(1) #chooses random start location for later comparison
kmeans_obj_Rep = kmeans(clust_data_Rep, centers = 2, 
                        algorithm = "Lloyd")   #<- there are several ways of implementing
#View the results
kmeans_obj_Rep
## K-means clustering with 2 clusters of sizes 225, 202
## 
## Cluster means:
##         aye      nay     other
## 1 122.56889 106.9956  90.43556
## 2  70.32673 145.6337 104.03960
## 
## Clustering vector:
##   [1] 2 2 2 2 1 1 1 2 1 2 1 1 1 1 2 1 2 2 1 1 1 1 2 2 1 2 1 2 1 2 2 2 2 1 1 2 1
##  [38] 1 2 1 2 2 2 2 2 1 2 2 1 2 2 2 2 2 1 1 2 1 2 2 2 2 2 1 1 1 1 2 2 1 2 2 2 2
##  [75] 2 1 1 2 1 1 1 2 1 1 1 1 2 2 2 2 1 1 1 2 2 2 1 2 2 2 2 1 1 2 1 2 1 1 1 1 1
## [112] 1 2 2 2 1 1 2 1 1 2 2 1 1 1 1 1 1 2 2 2 2 2 1 2 1 2 1 1 2 1 2 2 2 2 2 1 1
## [149] 2 2 1 1 1 2 1 2 1 2 1 2 1 1 2 1 2 1 2 1 1 1 2 2 1 2 2 2 2 1 1 1 1 2 1 2 2
## [186] 1 1 1 1 2 1 2 1 2 2 2 1 1 2 2 1 2 2 2 1 2 1 1 1 2 1 1 1 1 2 2 1 1 2 1 2 1
## [223] 2 2 2 1 1 1 1 1 1 2 2 1 1 1 2 2 1 2 2 2 2 1 1 1 1 1 2 2 2 2 1 1 2 2 2 1 1
## [260] 1 2 1 2 1 2 1 2 1 2 1 1 1 1 2 1 1 1 2 2 2 2 2 2 2 1 1 2 2 1 1 1 1 1 2 2 2
## [297] 2 1 2 1 1 1 1 2 1 2 2 1 2 2 1 1 1 2 2 1 2 1 1 2 1 1 2 2 1 1 2 2 1 1 1 2 1
## [334] 2 1 1 1 2 2 2 1 1 1 1 1 1 1 2 1 1 2 2 2 1 1 1 2 1 1 1 1 2 1 2 1 1 2 2 2 2
## [371] 1 1 1 2 1 2 2 2 1 2 1 1 2 1 1 2 1 2 1 1 1 1 1 2 1 2 2 1 2 2 1 1 1 1 1 2 1
## [408] 1 2 2 1 2 1 1 1 1 1 2 1 2 1 2 1 2 2 1 2
## 
## Within cluster sum of squares by cluster:
## [1] 43093.49 77671.01
##  (between_SS / total_SS =  79.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
#ratio = 79.5%
#Visualize the output
party_clusters_Rep = as.factor(kmeans_obj_Rep$cluster)


ggplot(house_votes_Rep, aes(x = aye, 
                            y = nay,
                            color = party.labels,  #<- tell R how to color 
                            #   the data points
                            shape = party_clusters_Rep)) + 
  geom_point(size = 6) +
  ggtitle("Aye vs. Nay votes for Republican-introduced bills") +
  xlab("Number of Aye Votes") +
  ylab("Number of Nay Votes") +
  scale_shape_manual(name = "Cluster", 
                     labels = c("Cluster 1", "Cluster 2"),
                     values = c("1", "2")) +
  scale_color_manual(name = "Party",         #<- tell R which colors to use and
                     #   which labels to include in the legend
                     labels = c("Democratic", "Republican"),
                     values = c("blue", "red")) +
  theme_light()

#Evaluate the quality of the clustering 

# Inter-cluster variance,
# "betweenss" is the sum of the distances between points 
# from different clusters.
num_Rep = kmeans_obj_Rep$betweenss

# Total variance, "totss" is the sum of the distances
# between all the points in the data set.
denom_Rep = kmeans_obj_Rep$totss

# Variance accounted for by clusters.
(var_exp_Rep = num_Rep / denom_Rep)
## [1] 0.7952692
#Use the function we created to evaluate several different number of clusters
explained_variance = function(data_in, k){
  
  # Running the kmeans algorithm.
  set.seed(1)
  kmeans_obj = kmeans(data_in, centers = k, algorithm = "Lloyd", iter.max = 30)
  
  # Variance accounted for by clusters:
  # var_exp = intercluster variance / total variance
  var_exp = kmeans_obj$betweenss / kmeans_obj$totss
  var_exp  
}

explained_var_Rep = sapply(1:10, explained_variance, data_in = clust_data_Rep)
explained_var_Rep
##  [1] -9.867882e-16  7.952692e-01  8.462368e-01  8.743898e-01  8.897348e-01
##  [6]  9.095485e-01  9.283358e-01  9.352083e-01  9.361643e-01  9.435883e-01
#Create a elbow chart of the output 
elbow_data_Rep = data.frame(k = 1:10, explained_var_Rep)

ggplot(elbow_data_Rep, 
       aes(x = k,  
           y = explained_var_Rep)) + 
  geom_point(size = 4) +           #<- sets the size of the data points
  geom_line(size = 1) +            #<- sets the thickness of the line
  xlab('k') + 
  ylab('Inter-cluster Variance / Total Variance') + 
  theme_light()

#looks like 2 clusters is best based on elbow plot
#Use NbClust to select a number of clusters
(nbclust_obj_Rep = NbClust(data = clust_data_Rep, method = "kmeans"))
## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## Warning in log(det(P)/det(W)): NaNs produced

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 12 proposed 2 as the best number of clusters 
## * 4 proposed 3 as the best number of clusters 
## * 1 proposed 5 as the best number of clusters 
## * 1 proposed 6 as the best number of clusters 
## * 2 proposed 7 as the best number of clusters 
## * 1 proposed 13 as the best number of clusters 
## * 1 proposed 14 as the best number of clusters 
## * 1 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************
## $All.index
##          KL        CH Hartigan     CCC    Scott   Marriot     TrCovW    TraceW
## 2  131.0817 1650.8972 140.8739 76.7950      NaN -534.8540 4148081409 120764.50
## 3    0.8084 1166.7437 100.0769 73.2048      NaN -299.2478 2092420152  90700.26
## 4    0.7187  992.4333  54.3510 68.7774 16304.01  253.9222 1364506819  73380.28
## 5    0.3692  851.5327  92.8867 65.0656      NaN   -4.0054 1089970216  65025.23
## 6    0.6653  847.7351 109.7583 64.6344 16862.81  154.3630  711907875  53294.53
## 7   11.4461  906.7565  44.6891 65.8317 17305.83   74.4479  448521362  42273.48
## 8    0.5271  864.2396  44.9357 64.7719      NaN  -36.2018  365507694  38208.04
## 9    2.2064  840.9147  25.6650 64.1975 17740.77   44.4388  300834537  34507.30
## 10   0.4156  794.3214  38.9806 63.0588      NaN   -3.9473  274181075  32511.13
## 11   0.7418  783.7302  40.9588 62.8420      NaN  -65.7978  223035146  29731.84
## 12   0.4787  784.4653  58.9331 62.9414 17615.49  105.9411  184024352  27066.87
## 13   1.2365  824.1301  52.1301 64.0516 18438.62   18.0883  142963370  23701.14
## 14   2.4682  858.4573  32.3669 65.0020      NaN   -3.3429  111468031  21050.50
## 15   0.4565  859.8358  48.2537 65.1549 18991.72    6.5938   96034586  19520.66
##         Friedman    Rubin Cindex     DB Silhouette   Duda  Pseudot2   Beale
## 2  -1.196478e+14 127.4114 0.1605 0.5038     0.7149 1.1814  -31.0231 -0.2603
## 3  -3.098874e+14 169.6442 0.2177 0.7555     0.6854 4.2585 -152.2704 -1.2096
## 4   4.270813e+14 209.6854 0.1694 1.1057     0.4545 0.9574    7.2442  0.0717
## 5  -3.264814e+16 236.6277 0.1605 1.0231     0.4588 1.7872  -52.8556 -0.7360
## 6   8.442332e+14 288.7120 0.1330 1.1476     0.3324 1.7658  -32.9601 -0.7271
## 7   1.498087e+15 363.9817 0.1906 1.0181     0.3467 2.8985  -89.0786 -0.9757
## 8  -3.296079e+15 402.7103 0.1807 1.0080     0.3510 0.9977    0.2089  0.0039
## 9   2.747395e+15 445.8991 0.1743 0.9032     0.3627 0.5647   19.2747  1.2963
## 10 -3.297302e+16 473.2770 0.1938 1.0032     0.3274 6.8536 -116.1566 -1.3908
## 11 -2.061187e+15 517.5183 0.1906 0.9419     0.3232 0.7118   15.7916  0.6801
## 12  1.268548e+15 568.4725 0.1883 0.9347     0.3356 2.3591  -33.4148 -0.9416
## 13  6.598389e+15 649.1998 0.1706 0.9088     0.3436 2.0965  -11.5065 -0.8533
## 14 -3.306221e+16 730.9457 0.1539 0.8893     0.3603 6.8406  -61.4746 -1.3566
## 15  1.651659e+16 788.2302 0.2181 0.9026     0.3542 0.7263   24.1153  0.6289
##    Ratkowsky      Ball Ptbiserial    Frey McClain   Dunn Hubert SDindex  Dindex
## 2     0.5589 60382.252     0.8806  0.5216  0.2779 0.1005      0  0.0763 13.4946
## 3     0.5020 30233.420     0.8967  5.6954  0.2853 0.0782      0  0.0934 12.5274
## 4     0.4533 18345.069     0.6993  2.3087  0.5622 0.0156      0  0.1517 11.0551
## 5     0.4101 13005.045     0.6586  1.5951  0.6556 0.0156      0  0.1620 10.2381
## 6     0.3805  8882.422     0.5582  1.0498  0.9561 0.0156      0  0.1535  9.2055
## 7     0.3580  6039.068     0.5250  1.4422  1.0747 0.0245      0  0.1471  8.4672
## 8     0.3360  4776.005     0.4959  0.8232  1.2010 0.0245      0  0.1573  7.8994
## 9     0.3192  3834.144     0.4855  2.2024  1.2426 0.0245      0  0.1627  7.5574
## 10    0.3019  3251.113     0.4598  2.5479  1.3859 0.0280      0  0.1973  7.2851
## 11    0.2896  2702.895     0.4418  3.0898  1.5030 0.0280      0  0.2265  7.0407
## 12    0.2791  2255.573     0.4241  0.3815  1.6339 0.0280      0  0.2725  6.6572
## 13    0.2688  1823.164     0.4117  0.1275  1.6649 0.0280      0  0.2661  6.3009
## 14    0.2603  1503.607     0.4107 -7.8443  1.5852 0.0280      0  0.2114  5.9222
## 15    0.2516  1301.377     0.3854  0.4523  1.8242 0.0394      0  0.2505  5.7283
##      SDbw
## 2  0.2255
## 3  0.2677
## 4  0.4023
## 5  0.3697
## 6  0.4053
## 7  0.3781
## 8  0.3099
## 9  0.3628
## 10 0.2101
## 11 0.2945
## 12 0.2095
## 13 0.2059
## 14 0.1430
## 15 0.1779
## 
## $All.CriticalValues
##    CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2          0.6390           114.1248       1.0000
## 3          0.2115           742.0166       1.0000
## 4          0.2887           401.6280       0.9749
## 5          0.4868           126.4976       1.0000
## 6          0.5151            71.5437       1.0000
## 7          0.0438          2971.3276       1.0000
## 8          0.5066            86.6886       0.9997
## 9          0.5413            21.1850       0.2763
## 10         0.3322           273.4259       1.0000
## 11         0.5318            34.3418       0.5650
## 12         0.3500           107.6921       1.0000
## 13         0.3414            42.4447       1.0000
## 14         0.2298           241.3514       1.0000
## 15         0.4783            69.8184       0.5974
## 
## $Best.nc
##                       KL       CH Hartigan    CCC    Scott  Marriot     TrCovW
## Number_clusters   2.0000    2.000   7.0000  2.000  13.0000   5.0000          3
## Value_Index     131.0817 1650.897  65.0692 76.795 823.1304 416.2959 2055661257
##                   TraceW    Friedman    Rubin Cindex     DB Silhouette   Duda
## Number_clusters     3.00 1.50000e+01   7.0000  6.000 2.0000     2.0000 2.0000
## Value_Index     12744.26 4.95788e+16 -36.5411  0.133 0.5038     0.7149 1.1814
##                 PseudoT2   Beale Ratkowsky     Ball PtBiserial Frey McClain
## Number_clusters   2.0000  2.0000    2.0000     3.00     3.0000    1  2.0000
## Value_Index     -31.0231 -0.2603    0.5589 30148.83     0.8967   NA  0.2779
##                   Dunn Hubert SDindex Dindex   SDbw
## Number_clusters 2.0000      0  2.0000      0 14.000
## Value_Index     0.1005      0  0.0763      0  0.143
## 
## $Best.partition
##   [1] 2 2 2 2 1 1 1 2 1 2 1 1 1 1 2 1 2 2 1 1 1 1 2 2 1 2 1 2 1 2 2 2 2 1 1 2 1
##  [38] 1 2 1 2 2 2 2 2 1 2 2 1 2 2 2 2 2 1 1 2 1 2 2 2 2 2 1 1 1 1 2 2 1 2 2 2 2
##  [75] 2 1 1 2 1 1 1 2 1 1 1 1 2 2 2 2 1 1 1 2 2 2 1 2 2 2 2 1 1 2 1 2 1 1 1 1 1
## [112] 1 2 2 2 1 1 2 1 1 2 2 1 1 1 1 1 1 2 2 2 2 2 1 2 1 2 1 1 2 1 2 2 2 2 2 1 1
## [149] 2 2 1 1 1 2 1 2 1 2 1 2 1 1 2 1 2 1 2 1 1 1 2 2 1 2 2 2 2 1 1 1 1 2 1 2 2
## [186] 1 1 1 1 2 1 2 1 2 2 2 1 1 2 2 1 2 2 2 1 2 1 1 1 2 1 1 1 1 2 2 1 1 2 1 2 1
## [223] 2 2 2 1 1 1 1 1 1 2 2 1 1 1 2 2 1 2 2 2 2 1 1 1 1 1 2 2 2 2 1 1 2 2 2 1 1
## [260] 1 2 1 2 1 2 1 2 1 2 1 1 1 1 2 1 1 1 2 2 2 2 2 2 2 1 1 2 2 1 1 1 1 1 2 2 2
## [297] 2 1 2 1 1 1 1 2 1 2 2 1 2 2 1 1 1 2 2 1 2 1 1 2 1 1 2 2 1 1 2 2 1 1 1 2 1
## [334] 2 1 1 1 2 2 2 1 1 1 1 1 1 1 2 1 1 2 2 2 1 1 1 2 1 1 1 1 2 1 2 1 1 2 2 2 2
## [371] 1 1 1 2 1 2 2 2 1 2 1 1 2 1 1 2 1 2 1 1 1 1 1 2 1 2 2 1 2 2 1 1 1 1 1 2 1
## [408] 1 2 2 1 2 1 1 1 1 1 2 1 2 1 2 1 2 2 1 2
# View the output of NbClust.
nbclust_obj_Rep
## $All.index
##          KL        CH Hartigan     CCC    Scott   Marriot     TrCovW    TraceW
## 2  131.0817 1650.8972 140.8739 76.7950      NaN -534.8540 4148081409 120764.50
## 3    0.8084 1166.7437 100.0769 73.2048      NaN -299.2478 2092420152  90700.26
## 4    0.7187  992.4333  54.3510 68.7774 16304.01  253.9222 1364506819  73380.28
## 5    0.3692  851.5327  92.8867 65.0656      NaN   -4.0054 1089970216  65025.23
## 6    0.6653  847.7351 109.7583 64.6344 16862.81  154.3630  711907875  53294.53
## 7   11.4461  906.7565  44.6891 65.8317 17305.83   74.4479  448521362  42273.48
## 8    0.5271  864.2396  44.9357 64.7719      NaN  -36.2018  365507694  38208.04
## 9    2.2064  840.9147  25.6650 64.1975 17740.77   44.4388  300834537  34507.30
## 10   0.4156  794.3214  38.9806 63.0588      NaN   -3.9473  274181075  32511.13
## 11   0.7418  783.7302  40.9588 62.8420      NaN  -65.7978  223035146  29731.84
## 12   0.4787  784.4653  58.9331 62.9414 17615.49  105.9411  184024352  27066.87
## 13   1.2365  824.1301  52.1301 64.0516 18438.62   18.0883  142963370  23701.14
## 14   2.4682  858.4573  32.3669 65.0020      NaN   -3.3429  111468031  21050.50
## 15   0.4565  859.8358  48.2537 65.1549 18991.72    6.5938   96034586  19520.66
##         Friedman    Rubin Cindex     DB Silhouette   Duda  Pseudot2   Beale
## 2  -1.196478e+14 127.4114 0.1605 0.5038     0.7149 1.1814  -31.0231 -0.2603
## 3  -3.098874e+14 169.6442 0.2177 0.7555     0.6854 4.2585 -152.2704 -1.2096
## 4   4.270813e+14 209.6854 0.1694 1.1057     0.4545 0.9574    7.2442  0.0717
## 5  -3.264814e+16 236.6277 0.1605 1.0231     0.4588 1.7872  -52.8556 -0.7360
## 6   8.442332e+14 288.7120 0.1330 1.1476     0.3324 1.7658  -32.9601 -0.7271
## 7   1.498087e+15 363.9817 0.1906 1.0181     0.3467 2.8985  -89.0786 -0.9757
## 8  -3.296079e+15 402.7103 0.1807 1.0080     0.3510 0.9977    0.2089  0.0039
## 9   2.747395e+15 445.8991 0.1743 0.9032     0.3627 0.5647   19.2747  1.2963
## 10 -3.297302e+16 473.2770 0.1938 1.0032     0.3274 6.8536 -116.1566 -1.3908
## 11 -2.061187e+15 517.5183 0.1906 0.9419     0.3232 0.7118   15.7916  0.6801
## 12  1.268548e+15 568.4725 0.1883 0.9347     0.3356 2.3591  -33.4148 -0.9416
## 13  6.598389e+15 649.1998 0.1706 0.9088     0.3436 2.0965  -11.5065 -0.8533
## 14 -3.306221e+16 730.9457 0.1539 0.8893     0.3603 6.8406  -61.4746 -1.3566
## 15  1.651659e+16 788.2302 0.2181 0.9026     0.3542 0.7263   24.1153  0.6289
##    Ratkowsky      Ball Ptbiserial    Frey McClain   Dunn Hubert SDindex  Dindex
## 2     0.5589 60382.252     0.8806  0.5216  0.2779 0.1005      0  0.0763 13.4946
## 3     0.5020 30233.420     0.8967  5.6954  0.2853 0.0782      0  0.0934 12.5274
## 4     0.4533 18345.069     0.6993  2.3087  0.5622 0.0156      0  0.1517 11.0551
## 5     0.4101 13005.045     0.6586  1.5951  0.6556 0.0156      0  0.1620 10.2381
## 6     0.3805  8882.422     0.5582  1.0498  0.9561 0.0156      0  0.1535  9.2055
## 7     0.3580  6039.068     0.5250  1.4422  1.0747 0.0245      0  0.1471  8.4672
## 8     0.3360  4776.005     0.4959  0.8232  1.2010 0.0245      0  0.1573  7.8994
## 9     0.3192  3834.144     0.4855  2.2024  1.2426 0.0245      0  0.1627  7.5574
## 10    0.3019  3251.113     0.4598  2.5479  1.3859 0.0280      0  0.1973  7.2851
## 11    0.2896  2702.895     0.4418  3.0898  1.5030 0.0280      0  0.2265  7.0407
## 12    0.2791  2255.573     0.4241  0.3815  1.6339 0.0280      0  0.2725  6.6572
## 13    0.2688  1823.164     0.4117  0.1275  1.6649 0.0280      0  0.2661  6.3009
## 14    0.2603  1503.607     0.4107 -7.8443  1.5852 0.0280      0  0.2114  5.9222
## 15    0.2516  1301.377     0.3854  0.4523  1.8242 0.0394      0  0.2505  5.7283
##      SDbw
## 2  0.2255
## 3  0.2677
## 4  0.4023
## 5  0.3697
## 6  0.4053
## 7  0.3781
## 8  0.3099
## 9  0.3628
## 10 0.2101
## 11 0.2945
## 12 0.2095
## 13 0.2059
## 14 0.1430
## 15 0.1779
## 
## $All.CriticalValues
##    CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2          0.6390           114.1248       1.0000
## 3          0.2115           742.0166       1.0000
## 4          0.2887           401.6280       0.9749
## 5          0.4868           126.4976       1.0000
## 6          0.5151            71.5437       1.0000
## 7          0.0438          2971.3276       1.0000
## 8          0.5066            86.6886       0.9997
## 9          0.5413            21.1850       0.2763
## 10         0.3322           273.4259       1.0000
## 11         0.5318            34.3418       0.5650
## 12         0.3500           107.6921       1.0000
## 13         0.3414            42.4447       1.0000
## 14         0.2298           241.3514       1.0000
## 15         0.4783            69.8184       0.5974
## 
## $Best.nc
##                       KL       CH Hartigan    CCC    Scott  Marriot     TrCovW
## Number_clusters   2.0000    2.000   7.0000  2.000  13.0000   5.0000          3
## Value_Index     131.0817 1650.897  65.0692 76.795 823.1304 416.2959 2055661257
##                   TraceW    Friedman    Rubin Cindex     DB Silhouette   Duda
## Number_clusters     3.00 1.50000e+01   7.0000  6.000 2.0000     2.0000 2.0000
## Value_Index     12744.26 4.95788e+16 -36.5411  0.133 0.5038     0.7149 1.1814
##                 PseudoT2   Beale Ratkowsky     Ball PtBiserial Frey McClain
## Number_clusters   2.0000  2.0000    2.0000     3.00     3.0000    1  2.0000
## Value_Index     -31.0231 -0.2603    0.5589 30148.83     0.8967   NA  0.2779
##                   Dunn Hubert SDindex Dindex   SDbw
## Number_clusters 2.0000      0  2.0000      0 14.000
## Value_Index     0.1005      0  0.0763      0  0.143
## 
## $Best.partition
##   [1] 2 2 2 2 1 1 1 2 1 2 1 1 1 1 2 1 2 2 1 1 1 1 2 2 1 2 1 2 1 2 2 2 2 1 1 2 1
##  [38] 1 2 1 2 2 2 2 2 1 2 2 1 2 2 2 2 2 1 1 2 1 2 2 2 2 2 1 1 1 1 2 2 1 2 2 2 2
##  [75] 2 1 1 2 1 1 1 2 1 1 1 1 2 2 2 2 1 1 1 2 2 2 1 2 2 2 2 1 1 2 1 2 1 1 1 1 1
## [112] 1 2 2 2 1 1 2 1 1 2 2 1 1 1 1 1 1 2 2 2 2 2 1 2 1 2 1 1 2 1 2 2 2 2 2 1 1
## [149] 2 2 1 1 1 2 1 2 1 2 1 2 1 1 2 1 2 1 2 1 1 1 2 2 1 2 2 2 2 1 1 1 1 2 1 2 2
## [186] 1 1 1 1 2 1 2 1 2 2 2 1 1 2 2 1 2 2 2 1 2 1 1 1 2 1 1 1 1 2 2 1 1 2 1 2 1
## [223] 2 2 2 1 1 1 1 1 1 2 2 1 1 1 2 2 1 2 2 2 2 1 1 1 1 1 2 2 2 2 1 1 2 2 2 1 1
## [260] 1 2 1 2 1 2 1 2 1 2 1 1 1 1 2 1 1 1 2 2 2 2 2 2 2 1 1 2 2 1 1 1 1 1 2 2 2
## [297] 2 1 2 1 1 1 1 2 1 2 2 1 2 2 1 1 1 2 2 1 2 1 1 2 1 1 2 2 1 1 2 2 1 1 1 2 1
## [334] 2 1 1 1 2 2 2 1 1 1 1 1 1 1 2 1 1 2 2 2 1 1 1 2 1 1 1 1 2 1 2 1 1 2 2 2 2
## [371] 1 1 1 2 1 2 2 2 1 2 1 1 2 1 1 2 1 2 1 1 1 1 1 2 1 2 2 1 2 2 1 1 1 1 1 2 1
## [408] 1 2 2 1 2 1 1 1 1 1 2 1 2 1 2 1 2 2 1 2
# View the output that shows the number of clusters each method recommends.
View(nbclust_obj_Rep$Best.nc)
#Display the results visually 
freq_k_Rep = nbclust_obj_Rep$Best.nc[1,]
freq_k_Rep = data.frame(freq_k_Rep)
View(freq_k_Rep)

# Check the maximum number of clusters suggested.
max(freq_k_Rep)
## [1] 15
#essentially resets the plot viewer back to default
#dev.off()

# histogram plot
ggplot(freq_k_Rep,
       aes(x = freq_k_Rep)) +
  geom_bar() +
  scale_x_continuous(breaks = seq(0, 15, by = 1)) +
  scale_y_continuous(breaks = seq(0, 12, by = 1)) +
  labs(x = "Number of Clusters",
       y = "Number of Votes",
       title = "Cluster Analysis")

#Using the recommended number of cluster compare the quality of the model 
#with 2 clusters 

#The recommended number of clusters using the elbow graph and nbclust was 2, so the two models will be equivalent.
#Bonus: Create a 3d version of the output
party_color3D_Rep = data.frame(party.labels = c("Democrat", "Republican"),
                               color = c("blue", "red"))

#View(party_color3D_Rep)


# (inner) Join the new data frame to our house_votes_Dem data set.
# when you inner join, they do not need the same dimensions, but you need the same column name!
house_votes_color_Rep = inner_join(house_votes_Rep, party_color3D_Rep)
## Joining, by = "party.labels"
#adds the cluster column
house_votes_color_Rep$clusters <- (party_clusters_Rep)

#View(house_votes_color_Rep)

#removes characters that aren't going to be parseable
house_votes_color_Rep$Last.Name <- gsub("[^[:alnum:]]", "", house_votes_color_Rep$Last.Name)

# Use plotly to do a 3d imaging 

fig <- plot_ly(house_votes_color_Rep, 
               type = "scatter3d",
               mode="markers",
               symbol = ~clusters,
               x = ~aye, 
               y = ~nay, 
               z = ~other,
               color = ~color, # ~ means "identify just this variable and use all layers (plotly)
               colors = c('#0C4B8E','#BF382A'), 
               text = ~paste('Representative:',Last.Name,
                             "Party:",party.labels))


fig

In a separate Rmarkdown document work through a similar process with the NBA data (nba2020-21 and nba_salaries_21), merge them together.

You are a scout for the worst team in the NBA, probably the Wizards. Your general manager just heard about Data Science and thinks it can solve all the teams problems!!! She wants you to figure out a way to find players that are high performing but maybe not highly paid that you can steal to get the team to the playoffs!

Details:

Hints:

submit : original Rmd and html files for both!